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ABSTRACT 



A system that provides for the graphic visualization of the 
categories of a collection of records. The graphic visualiza- 
tion is referred to as "category graph." The system option- 
ally displays the category graph as a "similarity graph" or a 
"hierarchical map." When displaying a category graph, the 
system displays a graphic representation of each category. 
The system displays the category graph as a similarity graph 
or a hierarchical map in a way that visually illustrates the 
similarity between categories. The display of a category 
graph allows a data analyst to better understand the simi- 
larity and dissimilarity between categories, A similarity 
graph includes a node for each category and an arc connect- 
ing nodes representing categories whose similarity is above 
a threshold. A hierarchical map is a tree structure that 
includes a node for each base category along with nodes 
representing combinations of similar categories. 
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METHOD AND SYSTEM FOR 
VISUALIZATION OF CLUSTERS AND 
CLASSIFICATIONS 

TCCHNICAL FIELD 

This invention relates generally to user interfaces and, 
more specifically, to user interfaces for visualization of 
categories of data. 

BACKGROUND OF THE INVENTION 

Computer systems have long been used for data analysis. 
For example, the data may include the demographics of 
users and web pages accessed by users. A web master (i.e., 
a manager of a web site) may desire to review the web page 
access patterns of the users in order to optimize the links 
between the various web pagjes or to customize advertise- 
ments to the demographics of the users. However, it may be 
very difScult for the web master to analyze the access 
patterns of thousands of users involving possibly hundreds 
of web pages. However, the difSculty in the analysis may be 
lessened if the users can be categorized by common demo- 
graphics and common web page access patterns. Two tech- 
niques of data categorization — classification and 
clustering — can be useful when analyzing large amounts of 
such data. These categorization techniques are used to 
categorize data represented as a collection of records con- 
taining values for various attributes. For example, each 
record may represent a user, and the attributes describe 
various characteristics of the user. The characteristics may 
include the sex, income, and age of the user, or web pages 
accessed by the \iscr, FIG. lA illustrates a collection of 
records as a table. Each record (1,2, ... 41) contains a value 
for each of the attributes (1,2, . . . ,m). For example, attribute 
4 may represent the age of a user and attribute 3 may indicate 
whether the user has accessed a certain web page. Therefore, 
the user represented by record 2 accessed the web page as 
represented by attribute 3 and is age 36 as represented by 
attribute 4. 

Classification techniques allow a data analyst (e.g., web 
master) to group the records of a collection into classes. That 
is, the data analyst reviews the attributes of each record, 
identifies classes, and then assigns each record to a class. 
FIG. IB illustrates the results of the classification of a 
collection. The data analyst has identified three classes: A, 
B, and C. In this example, records 1 and n have been 
assigned to class A; record 2 has been assigned to class B, 
and records 3 and n-1 have been assigned to class C. Thus, 
the data analyst determined that the attributes for rows 1 and 
n are similar enough to be in the same class. In this example, 
a record can only be in one class. However, certain records 
may have attributes that are similar to more than one class. 
Therefore, some classification techniques, and more gener- 
ally some categorization techniques, assign a probability 
that each record is in each class. For example, record 1 may 
have a probability of 0.75 of being in class A, a probability 
of 0.1 of being in class B, and a probability of 0.15 of being 
in class C. Once the data analyst has classified the records, 
standard classification techniques can be applied to create a 
classification rule that can be used to automatically classify 
new records as they are added to the collection, (e.g., Duda, 
R., and Hart, P., Pattern Classification and Scene Analysis, 
Wiley, 1973) FIG. IC illustrates the automatic classification 
of record n+1 when it is added to the collection. Id this 
example, the new record was automatically assigned to class 
B. 

Qustering techniques provide an automated process for 
analyzing the recoids of the collection and identifying 
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clusters of records that have similar attributes. For example, 
a data analyst may request a clustering system to cluster the 
records into five clusters. The clustering system would then 
identify which records are most similar and place them into 

5 one of the five chisters. (e.g., Ehida and Hart) Also, some 
clustering systems automatically determine the number of 
clusters. FIG. ID illustrates the results of the clustering of a 
collection. In this example, records 1, 2, and n have been 
assigned to cluster A, and records 3 and n-1 have been 

10 assigned to cluster B. Note that in this example the values 
stored in the column maiked "cluster" in FIG. ID have been 
determined by the clustering algorithm. 

Once the categories (e.g., classes and clusters) are 
established, the data analyst can use the attributes of the 

15 categories to guide decisions. For example, if one category 
represents users who are mostly teenagers, then a web 
master may decide to include advertisements directed to 
teenagers in the web pages that are accessed by users in this" 
category. However, the web master may not want to include 

20 advertisements directed to teenagers on a certain web page 
if users in a different category who are senior citizens also 
happen to access that web page frequently. Even though the 
categorization of the collection may reduce the amount of 
data, a data analyst needs to review from thousands of 

25 records to possibly 10 or 20 categories. The data analyst still 
needs to understand the similarity and dissimilarity of the 
records in the categories so that appropriate decisions can be 
made. 

30 SUMMARY OF THE INVENTION 

An embodiment of the present invention provides a 
category visualization ("CV") system that presents a graphic 
display of the categories of a coUection of records referred 
to as "category graph.** The CV system may optionally 
display the category graph as a "similarity graph" or a 
"hierarchical map." When displaying a category graph, the 
CV system displays a graphic representation of each cat- 
egory. The CV system d^plays the category graph as a 
similarity graph or a hierarchical map in a way that visually 
^ illustrates the similarity between categories. The display of 
a category graph allows a data analyst to better understand 
the similarity and dissimilarity between categories. A simi- 
larity graph includes a node for each category and an arc 
connecting nodes representing categories whose similarity is 
above a threshold. A hierarchical map is a tree structure that 
includes a node for each base category along with nodes 
representing combinations of similar categories. 

In another aspect of the present invention, the CV system 
calculates and displays various characteristic and discrimi- 
nating information about the categories. In particular, the 
CV system displays information describing the attributes of 
a category that best discriminate the records of that category 
from another category. The CV system also displays infor- 
mation describing the attributes that are most characteristic 
of a category. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIG- lA illustrates a collection of records as a table. 
^ FIG. IB illustrates the results of the classification of a 
collection. 

FIG. IC illustrates the automatic classification of record 
when it is added to the collection. 
FIG. ID illustrates the results of the clustering of a 
65 collection. 

FIGS. 2A-2F illustrate example displays of a similarity 
network. 
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FIGS. 3A-3K illustrate example displays of a hierarchical 
map in a tree format and the support provided for traversing 
this map and examining it. 

FIG. 4 illustrates an example display of a hierarchical map 
in a circular format. 

FIG. 5A illustrates characteristics pages of a category of 
users of web pages. 

FIG. SB illustrates discriminating pages for a category of 
users of web pages. 

FIG. 5C illustrates pairwisc discrimination for two cat- 
egories of users of web pages. 

FIGS. 6A-6B illustrate a 3-D graph of the probability that 
eacb attribute equals 1 for binary attributes for various 
clusters. 

FIG. 7 illustrates a decision tree format for displaying the 
categories of a collection. 

FIG. 8 illustrates the components of an embodiment of the 
category visualization system. 

FIG. 9 is a flow diagram of a routine for calculating the 
similarity of base categories. 

FIG. 10 is a flow diagram of a routine for displaying a 
similarity graph. 

FIG. 11 is a flow diagram of a routine for generating a 25 
hierarchical map. 

FIG. 12 is a flow diagram of a« routine to display a 
hierarchical map. 



DETAILED DESCRIPTION OF THE 
INTVENTION 
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An embodiment of the present invention provides a 
category visualization ("CV**) system that presents a graphic 
display of the categories of a collection of records referred 
to as "category graph." The CV system may optionaUy 
display the category graph as a "similarity graph" or a 
"hierarchical map." When displaying a category graph, the 
CV system displays a graphic representation of each cat- 
egory. The CV system displays the category graph as a 
similarity graph or a hierarchical map in a way that visually 
illustrates the similarity between categories. The display of 
a category graph aUows a data analyst to better understand 
the similarity and dissimilarity between categories. 

A similarity graph includes a node for each category and 45 
an arc connecting nodes representing categories that are 
similar. The CV system allows the data analyst to select a 
similarity threshold and then displays arcs between nodes 
representing pairs of categories whose similarity is above 
the similarity threshold. Similarity is a rating of how similar 50 
the records of one category are to the records of another 
category. A mathematical basis for similarity is provided 
below. As a data analyst changes the similarity threshold, the 
C V system a^ and removes ares between the nodes based 
o tfTBe decrease or increase of the similarity threshold . The 55 
CV system also allows the data analyst to combine catego- 
ries that are most similar and to split a combined category 
into its sub-categories. The CV system updates the display 
of the simflarity graph to reflect the combining and splitting 
of categories. 50 
IjI a hierarchical map includes a node for each base category 
' along with nodes representing combinations of similar cat- 
egories. A base category is a category identified by a 
categorization process (e.g., classification and clustering), 
whereas a- combincd category has been a ssigned the rccppls 65 
o f tw o nr mnr£^ha se categories. A leaf node representing 
each base'category tbrms the bottom of the hierarchy, and a 



root node representing a category that contains all the 
records in the collection fonms the top of the hierarchy. Each 
non-leaf node represents a combined category. Each noo- 
leaf node has two arcs that connect the non-leaf node to the 
two nodes rq>res6ntiog the sub-categories of the combined 
categories represented by the non-leaf node. To form the 
hierarchy, the CV system starts with the base categories and 
combines the two base categories that are most similar to 
form a combined category. The CV system then combines 
the two categories (including combined categories^ but not 
including any category that has already been combined) that 
are most similar. The CV system repeats this process until 
one combined category represents aU the records in the 
collection. 

The CV system allows a data analyst to interact with a 
category graph to obtain further information relating to the 
categories. In response to a data analyst selecting a displayed 
graphic representation, the CV system di^lays additional 
information about the represented category. For example, 
the CV system may display the number of records in the 
category or characteristic attributes of the category. In 
response to a data analyst selecting a displayed arc, the CV 
system displays information relating to the categories con- 
nected by the arc. For example, if the data analyst selects an 
arc in a similarity network, then the CV system may display 
the similarity value for the two categories represented by the 
nodes that the selected arc connects. The CV system also 
allows the user to de-emphasize (e.g., hide) the nodes 
representing certain categories so that data analysts may 
focus their attention on the other non-de-emphasized cat- 
egories. 

Although a mathematical basis for similarity is provided 
below in detail, similarity can be defined in many different 
ways. Conceptually, siznilarity rcfeis to a rating of the 
differences between the attribute values of the records in one 
category and the attribute values of the records in another 
category. A low value for similarity indicates that there is 
little difference between the records in the two categories. 

FIGS. 2A-2F illustrate example displays of a similarity 
network. The similarity network illustrates the simUarity 
between ten categories, which have been named based on 
web page access attributes. Table 1 lists the names of the 
categories and numbers of records in each category. 

TABLE 1 





Number of 


category hftune 


Records 


broad 


18 


web tools 


15789 


developer 


6632 


advanced oMoe 


3868 


office 


12085 


ie 


22621 


enterprise 


10162 


ofiBce support 


9516 


ie 8i];)poit 


6687 


windows support 


12618 



Window 200 contains a display area 201 and a slider 202. 
The similarity network 220 within the display area contains 
a node for each category and an arc for each pair of 
categories whose similarity is above the similarity threshold. 
For example, node 203 representing category "ie support" 
and node 204 representing category "windows support" 
have a similarity that is above the similarity threshold and 
are thus connected by arc 206. However, the similarity 
between category "ie support'* and category "enterprise** is 
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below the similarity threshold. Therefore, the similarity example, a green arc may indicate a high degree of 

network has no arc between node 205 representing category similarity, whereas a red arc may indicate a low degree of 

''enterprise'* and node 203 representing category ''ie sup- similarity. 

port.'* The CV system allows the data analyst to control the 
The shading within the nodes of the similarity graph y combining and splitting of categories. In particular, the CV 
indicate the size (i.e., number of records) of the category that system allows the data analyst to combine categories that are 
the node represents relative to the category with the most most similar and to split categories that have been combined, 
number of records. Since category "ie" contains more The combining and splitting of categories allows the data 
records than any other category, the CV system shades the analyst to focus on more or fewer categories as needed. FIG. 
entire node representing category "ie.** Since category "win^^ lO 2£ illustrates the combining of the most similar categories, 
dow s support" lias a nu mber of records that is approximate ly The slider 202 may be used to control the combining and 
o ne-half the number of TEc^fg&Ja-^eate gory "ie," the CV splitting of categories. As the user moves the slider up a 
system shades approximately one-half of the node repre- notch, the CV system selects the two categories represented 
senting category "windows support." Alternatively, the by displayed nodes that arc most similar and combines those 
shading of the nodes can represent the number of records in 15 categories into a single category. The CV system then 
the category in relation to a total number of records in the removes the node for each of the categories to be combined 
collection. In such a case, the CV system would shade alonp with arcs oo nE^cted-to those categories and displays a 
approximately 10% of the node representing a category that new_] iode representing the combined ' category . In 'this ' 
contains 10% of the records of the collection. The nodes of example, categories "ie support" and "windows support" are 
a category graph can also have various graphic shapes. The 20 most similar. Therefore, nodes 203 and 204 and arcs con- 
nodes of the similarity graph in this example are displayed nected to those nodes have been removed and no de 210 
as an oval containing the name of the category that the node representia g_ the combined category "ie and windows sup - 
represents. Alternatively, the nodes may be any shape such p ort" hasbeen add^d . As the user moves the slider down a 
as a circle or a rectangle. FIG. 2B illustrates a sample notch, the CV system splits the categories that were last 
rectangular node. The node contains the name of the cat- 25 combined. Thus, when the slider is moved down a notch 
egory and the number of records in the category. The node after being moved up a notch, then the CV system displays 
also contains a shaded portion indicating that the proportion the same similarity graph that was displayed before the data 
of the number of records in that category to the total number analyst moved the slider. The CV^ system may animate 
of records in the collection. In an alternative embodiment, combining and splitting of categ ories. That is, the CV 
the node might also display other statistical information such 30 system shows the two nodes representing categories to be 
as the average vahie of an attribute (e.g., age) for records in combined moving towards each other to form a single node 
the category or the mode of an attribute (e.g., color). representing the combined categories. The CV system ani- 
The CV system provides the vertical slider 202, which mates the splitting of nodes by showing the reverse process, 
alternatively may be displayed as a horizontal slider, to To further help a data analyst to focus on certain 
allow the data analyst to set the similarity threshold. As the 35 categories, the CV system allows a data analyst to 
data analyst moves the slider up and down, the similarity de-emphasize a category. FIG. 2F illustrates the 
threshold increases or decreases. FIG. 2C illustrates the de-emphasizing of categories. When the data analyst sped- 
example similarity graph after the data analyst has decreased fies to de-emphasize a category, the CV system either 
the similarity threshold by moving the slider down. In this removes the node representing that category and all con- 
example, the similarity between category "enterprise" and 40 necting arcs from the similarity graph or displays that node 
category ''ie support" is now greater than the similarity and connecting arcs in a dimmed manner. For example, if the 
threshold. Thus, the CV system displays an arc 207 between data analyst specifies to de-emphasize category ''windows 
node 205 representing category "enterprise" and node 203 support," then the CV system removes node 204 represent- 
representing category '*ic support." If the data analyst then ing category '^windows support" aixi connecting arcs 206 
increases the similarity thre^^old by moving the slider to 45 and 212. 

where it was previously positioned, then the CV system FIGS. 3A-3K and 4A-4B illustrate the display of a 
would remove arc 207. hierarchical map. The CV system creates a hierarchical map 
Although the arcs of FIG. 2C indicate categories whose by starting with the base categories and successively corn- 
similarity is above the similarity threshold, the arcs do not bining the most similar categories to generate combined 
indicate relative similarity between categories. FIG. 2D 50 categories until a single combined category contains all the 
illustrates the example similarity graph indicating relative records of the collection. The constmction of the_hierare hy 
similarity. The CV system indicates the relative similarity of c an be guided by an automated procedure (e .g., as described 
two categories by the thickness of the arc connecting the hftr<>in\ hy Hir^d [j^ m from a us enp iavidin k fiuidance as to 
nodes. That is, the CV system displays a thick arc to connect w hich node s should be mcrged or split next, or by a 
nodes representing categories that are similar, and displays 55 c ombination of botlTusin g occisiSal user interaction . The 
a thin arc to connect nodes representing categories that are hierarchical map can be displayed in either tree format or 
not similar. In this example, since category "ie support" and circular format. With tree format selected, the CV system 
category "windows support** arc the most similar categories, displays the hierarchical map in a standard tree d ata stru c- 
the CV system has drawn the arc 206 connecting the node ture jayout wi t h the root node at the top of the di^laya nd 
203 representing category "ie support" and node 204 rep- 60 the leaf nodesaLttie tx)tfom of the display. Alternatively, the ^ — 
resenting category "windows support" as the thickest. The CV system may display th"e tree data Stmcture upside-down ^ 
CV system may alternatively use various graphic represen- with the root node at the bottom of the display and leaf nodes 
tations as indications of similarity between categories. For at the top of the display or sideways with the root node at one 
example, the proximity of the nodes to one another may side of the di^lay and the leaf nodes at the other side of the 
indicate the similarity. That is, nodes that are displayed 65 display. With circular format selected, the CV system dis- 
closest to each other are most similar. Also, the similarity of plays the hierarchical map in a circular layout with the leaf 
nodes may be indicated by the color of the arcs. For nodes at the perimeter of a circle and the root node at the 
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center. FIGS. 3A-3K illustrate the display of a hierarchical 
map in a tree format. FIG. 3A illustrates the display of a 
hierarchical map in the tree format with leaf nodes horizon- 
tally aligned. hierarchical map 300 contains a leaf node 
301-310 for each base category. The non-leaf nodes repre- 
sent combined categories. For example, node 311 represents 
a combined category ""support" that is a combination of 
category "office support" and category "windows support." 
Thus, ^e category represented by node 311 contains the 
records of the categories "office support" and 'Vindows 
support." The root node 319 of the hierarchical map repre- 
sents a category that contains all the records in the collec- 
tion. In FIG. 3A, all the leaf nodes are displayed horizontally 
aligned. In contrast, FIG. 3B illustrates a hierarchical map in 
which the leaf nodes are not horizontally aligned. The CV 
system allows a data analyst to select whether to display the 
leaf nodes horizontally aligned. When the leaf nodes are 
horizontally aligned, it may be easier for the data analyst to 
visually identify the base categories. However, it may be 
more difScult for the data analyst to identify the sub- 
categories of a combined category. 

Many of the user interface features of the similarity 
network have analogous features in the hierarchical map. 
For example, FIG. 3C illustrates the de-emphasizing of a 
base category. In this example, the data analyst has selected 
to de-emphasize the node 301 representing base category 
"office support." The CV system de-emphasizes the node 
301 by dimming or removing it. FIG. 3D illustrates the 
de-emphasizing of a combined category. In this example, the 
data analyst has selected to de-emphasize node 316 repre- 
senting the combined category "support/enterprise." The 
data analyst can select to de-emphasize both the selected 
node and all its descendent nodes (i.e., the subtree with the 
selected node as its root) or only the descendent nodes. If a 
data analyst selects to de-emphasize a subtree, then the CV 
system can represent the subtree as a single node or can dim 
or remove the subtree. 

When a data analyst moves a cursor over the nodes of a 
category graph, the CV system displays additional informa- 
tion for the node. FIG. 3£ illustrates the movement of the 
cursor over a node of a hierarchical map. In this example, the 
data analyst has moved a cursor over the node 309 repre- 
senting category ''office advanced." In this example, the 
complete name of the category is displayed. Alternatively, 
additional information about the node could be displayed, 
such as the number of records in the category. 

The CV system allows a data analyst to browse through 
a hierarchical map in either a top-down or bottom-up 
manner. The browsing displays the base categories and 
combined categories based on similarity. When browsing 
from the bottom up, the CV system displays nodes repre- 
senting combined categories (along with child nodes) in the 
same order as combined categories where generated when 
the hierarchical map was created. When browsing from the 
top down, the CV system displays the nodes representing 
combined categories in the reverse order. When browsing in 
a top-down manner, the CV system first displays the root 
node and its two child nodes because the root node repre- 
sents the combined category that was generated last. The CV 
system displays "next" and "previous" buttons for browsing 
down and up the hierarchy nodes. Alternatively, the CV 
system provides a slider that allows the data analyst to move 
forward ("next") and backward ("previous") for browsing 
up and down the hierarchy of nodes. In response to the data 
analyst selecting the "next" button, the CV system displays 
the child nodes representing the sub-categories of the dis- 
played node representing the combined category in reverse 
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order in which the combined categories were generated. 
Also, in response to a data analyst selection of the "previ- 
ous" button, the CV system removes the last child nodes 
displayed. When browsing in a bottom-up manner, the CV 
5 system first displays the node (and its child nodes) repre- 
senting the combined category that was generated first. In 
response to the data analyst selection of "next node," the CV 
system displays the node (and child nodes if not already 
displayed) representing the combined category that was next 
generated. Also, in response to a data analyst selection of the 
"previous" button, the CV system removes the node(s) 
displayed most recently. The CV system supports browsing 
a hierarchical map that is displayed in either tree or circular 
format. 

FIGS. 3F-3K illustrate the browsing features of the CV 
system. The browsing features allow the user to inCTeme o- 
tal ly display the hierarchical map ig jither a top-down or a 
bpttom-up manner. When the user selects a top-down 
browse, the root node 319 and its two child nodes 310 and 
318 are displayed initially. At eiach request to browse down, 
additional child nodes are displayed in the reverse order in 
which the child nodes were combined to generate combined 
categories. As shown in FIG. 3G, as the data analyst first 
requests to browse down, the CV system displays node 316 
representing the combined category "support/enterprise" 
and node 317 representing category "other." When the data 
analyst next requests to browse down, the CV system 
displays node 312 representing category "novice" and node 
315 representing category "advanced," which are child 
nodes of node 317 representing category "other." When the 

^ data analyst then requests to browse down, the CV system 
displays nodes 307 representing category "web tools" and 
node 313 representing category ''miscellaneous," which are 
child nodes of node 315 representing category "advanced." 
In this example, the data analyst has selected to reoenter the 
node that is being browsed down in the center of the display. 
Thus, node 315 is shown in the center of the display. 

When in browsing mode, the data analyst may select a 
node to display a list of various options for displaying 
information relating to the nodes. FIG. 3H illustrates the list 

^ of options for a selected node. In this example, the data 
analyst has selected node 315 representing category 
** advanced." When the node is selected, the CV system 
displays a pop-up window indicating the various options that 
may be selected by the user. Table 2 lists the options. 

TABLE 2 



Node summary 

Oompaie this node with parent 
Compare this node with sibling 
Cbnq|>aR this node to rest of the world 
Cbnq;>aie this iKKte with teft child 
Oompare this node with right child 
Coiq)aie the children of this node 

55 ■ 

A "node summary" includes more detailed information 
about the category that the node represents. For example, the 
node summary may include the number of records in the 
category and the percentage of the records that have various 

60 atU-ibute values, which is referred to as characteristic infor- 
mation. The "compare" options display similarity and dis- 
criminating information between the selected category and 
other categories. The discriminating information indicates 
which attributes distinguish the record in the selected cat- 

65 egory from records in other categories. 

FIGS. 3I-3K illustrate browsing in a bottom-up manner. 
FIG. 31 illustrates the initial display in a bottom-up browse. 
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In this example, node 313 representing combined category 
^miscellaneous'' is displayed along with its child node 308 
representing category "developer" and child node 309 rep- 
resenting category "ofi5ce advanced" because the combined 
category "miscellaneous" was the first combined category 
generated when generating the hierarchical map. Each time 
the user selects the "next" button an additional combined 
category is displayed in the order that the combined catego- 
ries were generated. FIG. 3J illustrates the display of the 
hierarchical map after the user has selected the "next" button 
three times. When the data analyst selects "next" button the 
first time, then the CV system displays node 311 represent- 
ing the "support" category plus its child node 301 repre- 
senting category "office support" and child node 302 repre- 
senting category "windows support." When the data analyst 
selects the "next" button for the second time, then the CV 
system displays node 312 representing category "novice" 
and its child node 305 representing category "office" and 
child' node 306 representing category "ie." When the data 
analyst selects the "next" button for the third time, the CV 
system displays node 314 representing category "support" 
along with its diild node 303 representing the category "ie 
support." The other child node 311 representing combined 
category "support" is already displayed. FIG. 3K illustrates 
the selection of node 314 representing the category "sup- 
port." The data analyst may also use a slider to browse the 
hierarchy up or down rather than use the "previous" and 
"next" buttons. The CV system can also animate the brows- 
ing of the hierarchical maps. When animating the browsing 
in a bottom-up manner, the CV system progressively dis- 
plays the nodes from the bottom of the hierarchy towaids the 
top at, for example, periodic time intervals. When animating 
browsing in a top-down manner, the CV system displays the 
root node first and then di^lays additional nodes periodi- 
cally until the leaf nodes are displayed 

FIG. 4 illustrates a hierarchical map displayed in circular 
format. The leaf nodes of the hierarchy are displayed in a 
circle. In the center of the circle is displayed the root node 
of the hierarchy. The other non-leaf nodes are displayed 
between the root node and the surrounding leaf nodes. The 
same visualization features (e.g., browsing and 
de-emphasizing) that are used with the tree format can be 
used with the circular format of the hierarchical map. Also, 
similarity information can be displayed along with a hier- 
archical map by, for example, using different color arcs to 
connect nodes representing the base categories. Thus, a 
similarity graph is effectively superimposed on a hierarchi- 
cal map. 

The CV system di^lays additional information about 
categories when requested by a data analyst. Hiis additional 
information includes characteristic and discriminating infor- 
mation. FIGS. 5A-5C illustrate weigfits of evidence infor- 
mation that may be displayed when a data analyst selects a 
node of a category graph. The weights of evidence infor- 
mation includes the identification of discriminating pages 
and characteristic pages. FIG. 5 A illustrates the display of 
the characteristics pages of category "enterprise.** The char- 
acteristic pages lists the web pages that are accessed by the 
users in a category in order based on the probability that a 
user in the category accesses the web page. The probability 
is equal to the number of users in the category who access 
the web page divided by the number of users in the category. 
The characteristic pages of category "enterprise" indicates 
that a user in that category has 0.915 probability of accessing 
the "windows" web page. Also, a user in that category has 
a 0.62 probability of accessing the "products" web page. 
P FIG. SB illustrates the discriminating pages for the cat- 
egory "enterprise." The top panel illustrates the web pages 
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that discriminate the category "enterprise" from all other 
categories. The web pages are listed in order based on their 
ability to discriminate all other categories. Web paj ges tend 
to be accessed by the users of a category and not ac cessed 

5 by users of tiie other categories tend to be most discrimi- 
nating. In t his example, the "wj p^nws" vieh-^gtf the 
"ntserver" web page, thcJ|prodH£ ts" web p a ge, an d so on 
serve toUisCTiminate use gjajcateg ory **enterprise" from all 
others. The bottom panel indicates the web pages that 

10 discriminate all other categories from "enterprise" category. 
Web pages-accessed by users of the other categories and not 
accessed by users of a selected category tend to l>e most 
discriminating. In this example, the "workshop" web page, 
the "ie" web page, and so on are used to discriminate all of 

IS the categories fi-om the category "enterprise." An example 
mathematical basis for discrimination is provided below. 
FIG. 5C illustrates the display of pairwise discrimination 
^ for two categories. In this exariiple, the user has selected to 
display information that tends to discriminate category 

20 "office support** from category "ie support." As shown by 
the top panel, the users of the category "office support" tend 
to use the "office" web page, whereas users of category "ie 
support" tend not to use the "office" web page. In contrast, 
the users of the category "ie support" tend to use the "ie" 

25 web page, whereas users of category "office support" tend 
not to use that web page. 

The CV system provides for displaying certain infonna- 
tion in a 3-D graphical form. FIG. 6A illustrates a 3-D graph 
of the probability that each attribute equals 1 for each binary 

30 attribute. The x-axis represents the categories, the y-axis 
represents the attributes, and the z-axis represents the prob- 
abilities. For example, the height of bar 601 represents the 
probability (of approximately 0.1) that a record in category 
1 will have a value of 1. In this example, the bars for a given 

35 attribute are shown in the same color or shade. FIG. 6B 
illustrates a 3-D graph of the same information as the graph 
of FIG. 6A except that the bars for a given category, rather 
than a given attribute, are shown in the same color or shade. 
These graphs therefore allow a data analyst to focus on 

40 attributes or categories. 

yy The CV system also provides fo r^isplaying cate gories in 
Va dedgon ^tree fonm at. FIG. 7 illustrates a decision tree 
format for displaying the categories of a collection. The 
dedsion tree 700 contains nodes corresponding to attributes 

45 and arcs correiq)onding to values of that attribute. The 
decision tree has node 701 conesponding to the attribute 
indicating whether a user accessed the "workshop" web 
page and arcs 701a and 701^ indicating the values of zero 
and non-zero for that attribute. Node 702 corresponds to the 

50 attribute indicating whether a user accessed the "intdev" 
web page and arcs 702a and 702^ indicating the values of 2 
and not 2. Thus, each node, except the root node, represents 
a setting of attribute values as indicated by the arcs in the 
path from that node to the root node. When a data analyst 

55 selects a node, the CV system displays a probability for each 
category that a record in that category will have the attribute 
settings that arc represented by the path. For example, when 
the data analyst selects node 7 03 representing the a ttribute 
setting of accessing the "worksho p" web page at least once 

60 and accessin g the "intdev" web page tw ice, the CV system 
displa ys table 7 04. The table identifies the categories, the 
number of records in each category that match those 
attribute settings, and the probabilities. For example, the first 
line "0 5 0.0039" indicates that category 0 has 5 records that 

65 match the attribute settings and that the probability for 
category 0 is 0.0039. The CV system generates the decision 
tree by adding a column to a collection of records that 
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contains the category of record. The CV system then applies 
a decision tree algorithm (e.g., Chickering, D., Heckerman, 
D., Meek, C, ''A Bayesian Approach to Learning Bayesiao 
Networks with Local Structure," Proceedings of the Thir- 
teenth Conference on Uncertainty in Artificial Intelligertce, ^ 
1997) to build a decision tree (or graph) in which the 
category column represents the target variable. 
Mathematical Basis 

The similarity in one embodiment corresponds to the lo 
"distance'' between die records in two categories. A ma±- 
ematical basis for a distance calculation is presented in the 
following. In the following, X^, . . . ,X„ refers to the 
variables representing tbe attributes and x^, . . . refers to 
the state of a variable, that is, tbe attribute values. First, 15 
however, various probabilities are defined that are used to 
calculate the distance. The probability of a record in a 
collection having attribute values x^, . r . is represented 
by the joint probability density function of the following 
equation: 20 



(la) 



Pixi x„) 
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(Ic) 



In one embodiment, the similarity, also known as 
distance, between two categories is given by the sum of the 
KuUback-Leibler (KL) distance between the records in the 
first category and tbe records in the second category and the 
KL distance between the records in the second category and 
the records in the first category. The distance is given by the 
symmetric divergence (H. Jeflferys, Theory of Probability, 
Oxford University Press, 1939): 

■^Jfh)>pQ^u--->^M) (2 a) 
Equation (2a). reduces to the following, equation: 



(2b) 



Cp(*i x„\fii)-p(xi x„\hi))lQe: 



xjhi) 



Pixi xjh2) 



Pihj)^ 



25 Thus, the distance between the first and second categories is 
where h^ represents category j, where p(hy) represents the the sum for all possible combinations of attribute values of 
probability that any record is in category j, and where a first probability that a record with that combination of 
p(xi, . . . ,x„|h^) represents the conditional probability that a attribute values is in the first category minus a second 
record has attribute values x^, . . . ,x„ given that it is a record probabiUty that a record with that combination of attribute 
from category j. The probability that a record is in category 30 values is in the second category times the logarithm of the 
j is given by the following equation: first probabiUty divided by the second probabiHty. Since 

Equation (2b) requires a summation over all possible com- 
size(hj)^aj . (1 b) binations of attribute values, the determination of the simi- 

Zslzeihj)-t-aj larity using this formula is computationally expensive. 

35 When Equation (Ic) is substituted into Equation (2d), the 
result is the following equation: 

where size(hy) is a count of the number of records in 

category j, and the a,, are hypeiparameters (e.g., a-1 for all disHh.^h^) « V V (p(x,|A,) - p(;c,|A2))log 

j). For example, if category j contains lOftOO records and the * ^ pUilAa) 

collection contains 100,000 records, then p(hy)-0.1. ^ 

In one embodiment, it \s assumed that the probability that ^ . . . , , . „ ^ , 

a record with attribute values x^, . , . a„ is in category j is equation requires only the summation over all possible 

the product of the probabilities for each attribute value that ^^"^ f tiiXabjJtt, and not over all po^ible combina- 
atecordincategoryjhasthatattributevalueandisgivenby hons of attributes, and is thus computationally much more 
the foUowing equation: « efficient. . , ^ . 

Equation (2c) or, altemauvely, Equation (2b) provides a 
FT (I ^ calculate the similarity for a pair of base categories. 

p{xu...,xjhj)^\ [pixi\hj) ( Several different equations can be used to calculate the 

similarity between two combined categories. For example, 
SO when two categories are combined into a combined 
where p(xjh ) is the conditional probability that a record has category, then the similarity between the combined category 
the attribute value x,- for attribute i given that it is in category ^^^^y ^^^^ category (combined or not combined) needs 

j. This probabiUty is given by the foUowing equation: ^ calculated for the display of a similarity graph. 

Equations (3a), (3b), and (3c) provide three different tech- 
siMx h) + a- (Id) ni^ues for calculating the similarities with combined cat- 

p{Xi\hj) = ^^.^ A )+tt- egories. Tbe first technique averages the similarity between 

yi J » Q^f,^ p^j. Qf categories of the first and second combined 

categories and is given by the following equation: 

where size(x,.4i/) is the number of records in category j with 60 ^ . v v ... v^, ^i. . 

a value for attribute i that equals the attribute value x., where = 2. ^ p(A;)p(A*)rfis/(A,. h,) ( ) 

the summation is overall values of attribute i and where 

are hyperparameters (e.g., a^y=l, for all i and j). For 

example, if category j contains 10,000 records and 100 of where represents the first combined category and 

those records have a value of 1 for attribute i, then p(l|hy)= 65 represents the second combined category. Thus, the distance 
0.01. Equation (la) can be rewritten by substituting Equa- is the summation of the distances between each pair of 
tion (Ic) as the following equation: categories times the probabilities (Equation (lb)) that a 
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record is in each of the categories. The second and third Another technique for calculating the distance is to 

techniques calculate the distance as either the minimum or assume that the individual attributes are conditionally inde- 

maximum distance between any two pairs of categories in pendent given G^, G2, and the set of clusters not in union 

the first and second combined categories and are given by G2, yielding the formula 

the following equations: 5 

dist(!G^G^n{dist(hjJi^\h^G^^G:,} (3 b) '^'^^i' C2) = ^||;(p(a:i|Ci)-p(x.|C2))Io8^^^ ^^^^ 

dist{GJ}^i-max{Msiihi^^\h^GJt,^^^^^ (3 c) 

Another technique for calculating the distance is by 10 As discussed above, attribute-value discrimination refers 
treating a combined category as a noQ-combined category ^ how well the value of an attribute distinguishes the 
with the records of the sub-categories. Hiis technique results records of one category from the records of another cat- 
in the following equation: egory. One technique for calculating attribute-value dis- 
crimination is given by the following formula: 

disenm{xAGx. Ci) = p(:q|C7,)-p(x,|C2))lo8^^'^'^'^ ' 



ip(xi JcJCi)-p(x, xJG3))log 



P(xi x^\Gi). 



20 

where p(xi, . . . yX.JiG) is the conditional probability that a 

record has attribute values Xj, . . . given that it is a record where the probability that a record with a value of x, for 
from the combined category G. This probability is given by attributes in combioed category G^ is given by the following 
the following equation: equation: 



25 



(6b) 



I Pinj) pixi\C) = v~znr\ 

30 

Where the denominator is the sum of t^e probabilities that ^ Attribute-value discrimination scores can be positive, 
any record ^ m each categoiy G and the nurnerator is the AH ^^i^^^ score discrim(xjG„G^Js^sitive, 

smn for each category j m G of the probability that the record the o bservation of tbel tittributOaKmSS^ more 

With attribute values x„ . . . is in category j times the likd^lfigG,. If" the sc^ie discrim(ijG,,Gjiri5igalive, 
probability that a record m the coUection is in category j. 35 thenthir^ervation of the attribute^v ahi ex, m akes less 
Equation (4a), however, cannot be factored in the same way likcgTtEinlQTTftBglBore .G^) is zero, then the 

as Equation (2b), and thus the determination of the distance oblation of the attribute-value x,- leaves the relative 
between combined cat^ories Gj and G2 is computationally probabilities of Gi and G2 the same/xhe last case ahnost 
e^qjensive because a summation over aU possible combina- never occurs. 

tions of attribute values is needed. For example, if there arc 40 j^gj^ ^ several possibilities for displaying the attribute 
10 attributes with approximately 5 possible attribute values values and their corresponding discrimination scores. In one 
each, then there are approximately 10^ possible combioa- embodiment, all attribute values are displayed such that (1) 
tions of attribute values. Therefore, in one embodiment, the the attribute values with positive and negative scores appear 
CV system approximates the distance using a Monte Carlo in separate areas of the screen, and (2) the attribute values 
method such as simple sampling from and G2 where 45 ^^jj ^ largest scores (in absolute value) appear higher in 
Si,...,s^ denote the samples from Gi, and where tj,...,t, the list. Alternatively, the discrimination scores for all 
denote the samples from Gj (each s^ and t, corre^nd to the attribute values except distinguished values (e.g., x,-=0) are 
observations x^, . . .x„ for all attributes. (E.g., Shachter and displayed. Also, non-binary attributes may be binarized into 
Peot, "Simulation Approaches to General Probabalistic attributes that have only values 0 and non-zero before 
Inference in Belief Networks," Uncertainty in Artificial 50 displaying. 

Intelligence 5, p. 221-231, 1990.) The CV system approxi- The homogeneity of a category indicates how similar the 
mates the distance between two combined categories by records of the category are to one another. The homogeneity 
taking the sample data sets and applying them to the js given by the following: 
following: 



homiG) = Y P(^i Xm)\og p(xi x«|G) 



(7) 



where G represents a category or a combined category and 

where p(sJGy) and p(tjGy) are computed using Equation 60 where p(G|xi, . . . ,x„) is the probability that category G 

(4b). The number of samples from Gj and Gj are taken in contains the record with attribute values Xj, , . . pt„ 

proportion to p(Gi) and p(G^, where p(Gj) is the probability (obtainable from Bayes rule), 

that a record is in the set of categories defined by Gj. Implementation 

This Monte Carlo method can be used to calculate the FIG.8illustratesthecomponentsof an embodiment of the 

distance between both base and combined categories when 65 CV system. The CV system executes on computer system 

Equation (2b) without the independence assumption is used 800 which includes a central processing unit, memory, and 

as a distance. input/output devices. The CV system includes a collection 
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Storage componeat 801, a categorizer component 802, a 
category storage component 803, a user interface component 
804, and analysis component 805. The collection storage 
component contains the attribute value for each attribute of 
each record in the collection. Tht> r^tP>gnpy;pr r^ny pponent 
in puts the records of the collection storage component and 
ide ntifies the various categories and stores the identification 
of the categories in the category storage component. The, 
user interface component inputs data from the ooUection ' 
storage component and the category storage component and 
generates the various category graphs which are displayed 
on display 806. T he user interface component invokes t he"^ 
analysis^mponentj o. process the c ate gory storage in for- 
mation. The layout of the nodes can De determined by a 
variety of standard techniques for rendering graphs, includ- 
ing planar layouts, or any other scheme for minimizing edge 
crossings at display time. ^ 
■ FIG. 9 is a flow diagram of a routine for calculating th e 
similar ity of base categories. The routine loops selecting 
each possible pair of base categories and calculating the 
similarity in accordance with Equation (2c) or Equation (2b) 
without the independence assumption. One skilled in the art 
will appreciate that many other distances for calculating the 
similarity of categories can be used. For example, one could 
use the average hamming distance between records in each 25 
category. In step 901, the routine selects a first category h^. 
In step 902, if all the categories have already been selected 
as the first category, then the routine is done, else the routine 
continues at step 903. In step 903, the routine selects a 
second category b^ for which the similarity between the first 30 
and second categories has not yet been calculated. In step 
904, if all such categories have already been selected, then 
the routine loops to step 901 to select another first category, 
else the routine continues at step 905. In step 905, the routine 
calculates the similarity between the selected first and sec- 35 
ond categories and loops to step 903 to select another second 
category. 

FIG. 10 is a flow diagram of a routine for displaying a 
similarity graph. The routines display a node for each base 
category and then displays an arc between those nodes 40 
representing categories whose similarity is above the simi- 
larity threshold. In steps 1001-1003, the routine loops 
displaying nodes for the categories. In step 1001, the routine 
selects a category that has not yet been selected. In step 
1002, if all the categories have already been selected, then 
the routine continues at step 1004, else the routine continues 
at step 1003. In step 1003, the routine displays a node 
represent the selected category and loops to step 1001 to 
select the next category. In steps 1004-1007, the routine 
loops di^laying the arcs. In step 1004, the routine selects a 
pair of categories with a similarity above the similarity 
threshold. In step 1005, if all such pairs of categories have 
already been seleaed, then the routine is done, else the 
routine continues at step 1006. In step 1006, the routine 
determines the thickness of the arc to be di^layed between S5 
the selected pair of categories. In step 1007, the routine 
displays an arch of the determined thickness between the 
nodes representing the selected categories and loops to step 
1004 to select another pair of categories. 

FIG. 11 is a flow diagram of a routine for generating a 
hierarchical map. The routine starts with the base categories 
and successively combines categories that are most similar. 
In step 1101, the routine initializes a set of categories to 
contain each base category. In step 1102, if the set contains 
only one category, then the hierarchical map is complete and 
the routine is done, else the routine continues at step 1103. 
In steps 1103, the routine selects the next pair of categories 
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in the set that are most similar. Initially, the similarities of 
the base categories are calculated in accordance with the 
routine of FIG. 9. In step 1104, the routine removes the 
selected pair of categories from the set. In step 1105, the 
routine adds a combined category formed by the selected 
pair of categories to the set. In step 1106, the routine 
calculates the similarity between the combined category and 
every other category in the set according to Equation (5) and 
loops to step 1102 to determine whether the set contains only 
one category. 

FIG. 12 is a flow diagram of a routine to display a 
hierarchical map. In step 1201, the routine selects a com- 
bined category starting with the last combined category that 
was generated. In step 1202, if all the combined categories 
have already been selected, then the routine is done, else the 
routine continues at step 1203. In step 1203, the routine 
displayed a node representing the selected combined cat- 
egbry."*Ih'step 1204,' the routine displays an'arc between the 
displayed node and its parent node. In step 1205, the routine 
displays a node representing any base sub-category of the 
combined category along with connecting arcs. The routine 
then loops to step 1201 to select the next combined category. 

Although the present invention has been described in 
terms of various embodiments, it is not intended that the 
invention be limited to these embodiments, equivalents, 
methods, structures, processes, and steps. Note that modi- 
fications within the spirit of the invention fall within the 
scope of the invention. The scope of the present invention is 
defined by the claims that follows. 

What is claimed is: 

1. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 

for each category, determining a similarity of that cat- 
egory to every other category so that a similarity is 
determined for each pair of categories; 
displaying an indication of each category; and 
for each pair of categories, displaying an indication of the 
determined similarity of the pair of categories, wherein 
the displayed indication is an arc connecting the dis- 
played indication of each category in the pair of cat- 
egories. 

2. The method of claim 1 wherein the establishing of a 
similarity threshold includes receiving fiom a user an indi- 
cation of the similarity threshold. 

3. The method of claim 2 including displaying a slider so 
that the user can indicate the similarity threshold. 

4. The method of claim 3 wherein the slider is displayed 
horizontally. 

5. The method of claim 3 wherein the slider is di^layed 

vertically. 

6. The method of claim 2 wherein when the user estab- 
lishes a new similarity threshold, adjusting the displayed 
indications of determined similarity to reflect the new simi- 
larity threshold. 

7. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 

for each category, determining a similarity of that cat- 
egory to every other category so that a similarity is 
determined for each pair of categories; 

displaying an indication of each category; 

for each pair of categories, displaying an indication of the 
determined similarity of the pair of categories; 

establishing a similarity threshold; and 

displaying the indication of the determined similarity for 
only those pairs of categories whose determined simi- 
larity is above the established similarity threshold. 
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8. A method in a computer system for displaying a 
representatioD of categories of data, the method comprisiag: 

for each category, determining a similarity of that cat- 
egory to every other category so that a similarity is 
determined for each pair of categories; 5 
displaying an indication of each category; and 
for each pair of categories^ displaying an indication of the 
determined similarity of the pair of categories, wherein 
the displayed indication is an arc and thickness of the 
arc indicates the determined similarity between the pair 
of categories. 

9. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 

for each category, determining a similarity of that cat- 
egory to every other category so that a similarity is 
determined for each pair of categories; 
- displaying an indication of each category; 

for each pair of categories, displaying an indication of the 
determined similarity of the pair of categories; 20 

receiving a selection of a displayed indication of a cat- 
egory; and 

in response to the selection, displaying information relat- 
ing to the category; 

wherein the data includes attributes and the information 
relating to the selected category identifies attributes that 
discriminate the selected category from another cat- 
egory. 

10. The method of claim 9 wherein the identified 
attributes are ordered according to their ability to discrimi- ^ 
nate. 

11. The method of claim 9 wherein a discrimination 
metric is given by the following equation: 

discrifrt(xt\Gi, Ci) = p{Xi\Gi ) - 

where x,- represents a value of attribute i and where p(x,jG) ^ 
represents the conditional probability that a record with an 
attribute value X| given that the record is in category G. 

12. The method of claim 11 wherein 
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where hj represents category j, where p(hy) represents the 
probability that a record is in hy, and where p(xjhy) is the 
conditional probability that a record has the value X,- for 
attribute i is given that the record is in by. 

13. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 
for each category, determining a similarity of that cat- 
egory to every other category so that a similarity is 
determined for each pair of categories; 
displaying an indication of each category; 
for each pair of categories, displaying an indication of the 

determined similarity of the pair of categories; 
receiving an indication to de-emphasize a category; and 
in re^onse to the indication to de-emphasize a category, 65 
de-emphasizing the displayed indication of the cat- 
egory. 
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14. The method of claim 13 including: 

in response to receiving the indication to de-emphasizie a 
category, removing the displayed indication of the 
de-emphasized category. 

15. The method of claim 13 wherein the de-emphasizing 
is dimming of the displayed indication of the de-emphasized 
category. 

16. The method of claim 13 wherein the de-emphasizing 
is hiding of the displayed indication of the category. 

17. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 

for each category, determining a similarity of that cat- 
egory to every other category so that a similarity is 
determined for each pair of categories; 

displaying an indication of each category; 

for each pair of categories, displaying an indication of the 
determined similarity of the pair of categories; • 

receiving an indication to split a combined category; and 
in response to the indication to split a combined category, 

displaying an indication of a pair of categories for the 

combined category. 

18. The method of claim 17 including: 

displaying a slider and wherein movement of the dis- 
played slider indicates to split a combined category. 

19. The method of claim 17 wherein the displaying an 
indication of a pair of categories includes displaying an 
animation of splitting the indication of the combined cat- 
egory into the pair of indications of categories. 

20. The method of claim 17 wherein the category to be 
split is the combined category that was last combined. 

21. The method of claim 17 including: 

displaying a control and wherein selection of the control 
indicates to split categories. 

22. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 

receiving a hierarchical organization of the categories, the 
categories including a root category and leaf categories, 
each category except the leaf categories being a com- 
bined category; 

displaying an indication of each category in the hierar- 
chical organization; 

receiving an indication to de-emphasize a specific cat- 
egory; and 

in response to the indication to de-emphasize a category, 
de«emphasizing the displayed indications of categories 
in a sub-tree of which the specific category is a root. 

23. The method of claim 22 including wherein the 
de-emphasizing is removing of the displayed indications. 

24. The method of claim 22 wherein the de-emphasizing 
is dimming of the displayed indications. 

25. The method of claim 22 wherein the de-emphasizing 
is hiding of the displayed indications. 

26. A method in a computer system for displaying a 
representation of categories of data, the method comprising: 

receiving a hierarchical organization of the categories, the 
categories including a root category and leaf categories, 
each category except the leaf categories being a com- 
bined category; 

displaying an indication of each category in the hierar- 
chical organization; 

receiving an indication to de-emphasize a specific cat- 
egory; and 

in response to the indication to de-emphasize a category, 
removing all displayed indications of categories in a 
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sub-tree of which the specific category is a root, exclud- 36. The method of claim 35 wherein 
ing the displayed indication of the category correspond- 
ing to the root of the sub-tree. y ^^..^pr 

27. A method in a computer system for displaying a 

representation of categories of data, the method comprising: ^ Pixi\G) = z~p(hj) 

receiving a hierarchical organization of the categories, the 
categories including a root category and leaf categories, 

each category except the leaf categories being a com- where h^,. represents category j, where p(hy) represents the 

bined category; 10 probability that a record is in h,-, and where p(xjhy) is the 

displaying an indication of each category in the hierar- conditional probability that a record has the value for 

. . , . . attribute i is given that the record is in 11. 

cmcai orgamzauon; 3^ ^ ^^^^^ ^^^^ 27 wherein the data inchides 

receiving a selection of a displayed indication of a cat- attributes and wherein the information relating to the 

egory; and ^5 selected category identifies attributes are characteristic of 

in response to the selection, displaying information relat- the selected category. . . . r 

ine to the selected category * method of claim 27 wherein the mtormation 

■ ^i- rSi- • - ^^ r " ^^J^ . . - , . i.- • . - relating to the selected category is indicates the homogeneity-- 

28. The method of claim 27 wherem the mformation is of the category ' 
comparison information with a category that is a parent of 39 jhe method of claim 27 wherein a homogeneity is 
the selected category. given by the following equation: 

29. The method of claim 27 wherein the information is 

comparison information with a category that is a sibling of f^^^^ ^ y ^^gix, jc„)iog p(x, x„\G) 

the selected category. ^ 

30. The method of claim 27 wherein the information is 25 

comparison information with all categories other than the ^^ere G represents a category or combined category, where 

selected category. v(G\x^, • . • represents the probability that category G 

31. The method of claim 27 wherein the information is contains the record with attribute values Xj, . . . and 
comparison information with child categories of the selected where p(xi, . . . ^|G) represents the conditional probability 
category. ^° that a record has attribute values Xj, . . . ,x„ given that it is 

32. The method of claim 27 wherein the information is in category G. 

weights of evidence information. The method of claim 27 wherein the displayed infor- 

33. The method of claim 27 wherein the data includes mation relates to the similarity of a sub-categories of a 
attributes and the information relating to the selected cat- ,c combined category. . , . 
egory identifies attributes that discriminate the selected ^ °!^*^ ^ ^.^'"Pff ^yf^"^ for displaying a 
category from an other category. representation of categones of data, the method compnsmg: 

34. The method of claim 33 wherein the identified receiving a hierarcWcaloiganizationofthe categories, the 
attributes are ordered according to their ability to discrimi- c^t^goncs including a root category and leaf categones, 
^^^^ 40 ^ category except the leaf categones being a com- 

35. The method of claim 34 wherein a discrimination bined category, and . , . 

n *i. -J *•« J ** -u ♦ • u *u displaying an mdication of each category m the hierar- 

metnc, reflecting the identified attributes, is given by the u- 1 • *• 

f 11 • ^ . &* J chical organization; 

wherein the displayed indication of each category 

-5 includes an indication of the number of records in said 

discrmixACi. (h) = p{Xi\GO - ^^^^ category. 

P(jf/|Ci) l-p(XilGi) 42. The method of claim 41 wherein the indication of the 

p(xi|C2))log^^j^ + (p(xjG2) - pij,,\GO)]o&j-j^^-^ number of records is shading of a portion of the displayed 

indication of the category in proportion to the number of 
5Q records in the category to total number of records in all 

where x^- represents a value of attribute i and where p(xjG) categories, 
represents the conditional probability that a record with an 

attribute vahie x^ given that the record is in category G. « « « * 
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